1 Company

1.0.1 The TopBank Company:

A banking services company, with customers mainly in Europe.
Offers financial products such as bank account, investments and insurance.
Business Model: banking service through physical and online branches.
Main product: bank account free of charge, valid for 12 months. After that period, the account must be renewed.
Bank account revenue per customer:

  • 15% of the client’s estimated salary, for clients with estimated income below the average.
  • 20% of the client’s estimated salary, for clients with estimated income above the average.

1.0.2 Problem:

  • The rate of customer cancellation has increased significantly in recent months.
  • A taxa de cancelamento de clientes aumentou significativamente nos últimos meses.

1.0.3 goal:

  • Reduce customer churn, that is, prevent the customer from canceling the contract and not renewing it for another 12 months.
  • Reduzir o churn de clientes, ou seja, evitar que o cliente cancele o contrato e não o renove por mais 12 meses.

1.0.4 Deliverable:

Model performance and results report with the following topics:

  • What is the company’s current turnover rate?

  • Qual é a taxa de rotatividade atual da empresa?

  • How does the churn rate vary by month?

  • Como a taxa de churn varia por mês?

  • How does the model perform to label customers as churns?

  • Qual é o desempenho do modelo para prever os clientes como churns?

    • What is the company’s revenue, if it prevents customers from entering the contract without canceling it using the model developed?
  • Qual é a receita da empresa, se ela evita-se que os clientes entrem no cancela-sem o contrato por meio do modelo desenvolvido?

Action Plan:

  1. Offer a discount coupon or other financial incentive.
  2. Which customers could receive the incentive and at what cost, in order to maximize ROI (Return on Investment)?
    (the sum of the incentives must not exceed $ 10,000.00.)

Plano de Ação :

  1. Oferecer cupom de desconto ou outro incentivo financeiro.
  2. Quais clientes poderiam receber o incentivo e a que custo, a fim de maximizar o ROI (Return on Investment)?
    (a soma dos incentivos não deve exceder $ 10.000,00.)

2 Importing Needed packages

library(dplyr)
library(data.table)
library(ggplot2)
library(plotly)
library(vcd)
library(grid)
library(tidymodels)
library(treesnip)
library(lightgbm)

ggplot2::theme_set(ggplot2::theme_minimal())

2.1 Helper Functions

feature_engineering <- function(df){
  
  df <- df %>% dplyr::mutate(tenure_year = tenure + 1,
                             age_ten_year = age / tenure_year,
                             cred_ten_year = credit_score / tenure_year,
                             cred_age = credit_score / age,
                             amount = estimated_salary + balance,
                             amount_credit = amount / credit_score,
                             amount_ten_year = amount /tenure_year,
                             amount_prod = amount / num_of_products,
                             cred_prod = credit_score / num_of_products,
                             bal_ten_year = balance / tenure_year,
                             prod_m_cr = num_of_products - has_cr_card,
                             prod_t_cr = num_of_products * has_cr_card)
}


catcor <- function(x, type=c("cramer", "phi", "contingency")) {
    require(vcd)
    nc <- ncol(x)
    v <- expand.grid(1:nc, 1:nc)
    type <- match.arg(type)
    res <- matrix(mapply(function(i1, i2) assocstats(table(x[,i1],
        x[,i2]))[[type]], v[,1], v[,2]), nc, nc)
    rownames(res) <- colnames(res) <- colnames(x)
    res
}


minmax_scaler <- function(x) {
  
    
   return( ( x - min( x ) )  / ( max(x) - min(x) ) ) 
}



robust_scaler <- function(x){
  
  return( ( x - quantile( x , 0.5) )  / ( quantile(x ,0.75) - quantile(x, 0.25) ) )
  
}

ml_error <-  function(model_name = "Logistic Regression Model",model_predictions){
  Accuracy <- model_predictions %>%
    yardstick::precision( actual,predictions)
}

2.2 Reading the data

data_raw <- data.table::fread("/home/renato/repos/churn/data/churn.xls")

3 Descricption of Data

3.1 Rename Columns

data_raw <- data_raw %>% janitor::clean_names() 

head(data_raw)
   row_number customer_id  surname credit_score geography gender age tenure
1:          1    15634602 Hargrave          619    France Female  42      2
2:          2    15647311     Hill          608     Spain Female  41      1
3:          3    15619304     Onio          502    France Female  42      8
4:          4    15701354     Boni          699    France Female  39      1
5:          5    15737888 Mitchell          850     Spain Female  43      2
6:          6    15574012      Chu          645     Spain   Male  44      8
     balance num_of_products has_cr_card is_active_member estimated_salary
1:      0.00               1           1                1        101348.88
2:  83807.86               1           0                1        112542.58
3: 159660.80               3           1                0        113931.57
4:      0.00               2           0                0         93826.63
5: 125510.82               1           1                1         79084.10
6: 113755.78               2           1                0        149756.71
   exited
1:      1
2:      0
3:      1
4:      0
5:      0
6:      1

3.2 Features Description

RowNumber - The number of the row.
RowNumber - O numero de linhas.

CustomerID - Customer’s unique identifier.
CustomerID - Identificador único do cliente.

Surname - Customer’s surname.
Surname - Sobrenome do Cliente.

CreditScore - Customer’s credit score for the consumer market.
CreditScore - Pontuação de crédito do cliente para o mercado consumidor.

Geography - The country where the customer lives.
Geography - O país onde o cliente mora.

Gender - Customer’s gender.
Gender - Sexo do Cliente.

Age - Customer’s age.
Age - Idade do Cliente.

Tenure - Number of years that the customer was active.
Tenure - O numero de anos que Cliente esteve ativo.

Balance - The amount that the customer has in the bank account.
Balance - Saldo da Conta bancaria.

NumOfProducts - The number of products bought by the customer.
NumOfProducts - O número de produtos comprados pelo cliente.

HasCrCard - Flag that indicates if the customer has a credit card.
HasCrCard - Se o cliente possui cartão de credito.

IsActiveMember - Flag that indicates if the customer has done a bank activity in the last 12 months.
IsActiveMember - Sinalizador que indica se o cliente realizou uma atividade bancária nos últimos 12 meses.

EstimateSalary - Estimate customer’s monthly income.
EstimateSalary - Estimativa de renda mensal do Cliente.

Exited - Flag that indicates if the customer is in Churn.
Exited - Indica se cliente cancelou ou nao o contrato.

3.3 Data Dimensions

print(paste("Number of Rows: " ,nrow(data_raw)))
[1] "Number of Rows:  10000"
print(paste("Number of Cols: " ,ncol(data_raw)))
[1] "Number of Cols:  14"

3.4 Data Types

glimpse(data_raw)
Rows: 10,000
Columns: 14
$ row_number       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ customer_id      <int> 15634602, 15647311, 15619304, 15701354, 15737888, 15…
$ surname          <chr> "Hargrave", "Hill", "Onio", "Boni", "Mitchell", "Chu…
$ credit_score     <int> 619, 608, 502, 699, 850, 645, 822, 376, 501, 684, 52…
$ geography        <chr> "France", "Spain", "France", "France", "Spain", "Spa…
$ gender           <chr> "Female", "Female", "Female", "Female", "Female", "M…
$ age              <int> 42, 41, 42, 39, 43, 44, 50, 29, 44, 27, 31, 24, 34, …
$ tenure           <int> 2, 1, 8, 1, 2, 8, 7, 4, 4, 2, 6, 3, 10, 5, 7, 3, 1, …
$ balance          <dbl> 0.00, 83807.86, 159660.80, 0.00, 125510.82, 113755.7…
$ num_of_products  <int> 1, 1, 3, 2, 1, 2, 2, 4, 2, 1, 2, 2, 2, 2, 2, 2, 1, 2…
$ has_cr_card      <int> 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1…
$ is_active_member <int> 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1…
$ estimated_salary <dbl> 101348.88, 112542.58, 113931.57, 93826.63, 79084.10,…
$ exited           <int> 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…

3.5 Checking NA

colSums(is.na(data_raw))
      row_number      customer_id          surname     credit_score 
               0                0                0                0 
       geography           gender              age           tenure 
               0                0                0                0 
         balance  num_of_products      has_cr_card is_active_member 
               0                0                0                0 
estimated_salary           exited 
               0                0 
  • The data set has no missing values.
  • O conjunto de dados não possui valores ausentes.

3.6 Change Types

data_raw <- data_raw %>% 
  dplyr::mutate_if(is.character, as.factor)

3.7 Descriptive Statistics

# selecting only numeric features
num_attributes <- data_raw %>% 
  purrr::keep(is.numeric)

# selecting only categorical features
cat_attributes <- data_raw %>% 
  purrr::keep(is.factor)

3.7.1 Numeric Attributes

# Central Tendency  - mean , median
num_mean <- as.data.frame( t(lapply(num_attributes, mean)))

num_median <- as.data.frame( t(lapply(num_attributes, median)))

# dispersion - std, min, max, range, skew, kurtosis
num_std <- as.data.frame( t(lapply(num_attributes, sd)))

num_min <- as.data.frame( t(lapply(num_attributes, min)))

num_max <- as.data.frame( t(lapply(num_attributes, max)))

num_skew <- as.data.frame( t(lapply(num_attributes, e1071::skewness)))

num_kurt <- as.data.frame( t(lapply(num_attributes, e1071::kurtosis)))

table_desc <- t(bind_rows(num_min,num_max,num_mean,num_median,num_std,num_skew,num_kurt))

table_desc<- as.data.frame(table_desc)

names(table_desc) <- c("min","max","mean","median","std","skew", "kurtosis")

knitr::kable(table_desc, digits = 4) %>% 
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condesend", "responsive"),html_font = "Cambria")
min max mean median std skew kurtosis
row_number 1 10000 5000.5 5000.5 2886.896 0 -1.20036
customer_id 15565701 15815690 15690941 15690738 71936.19 0.001148801 -1.196475
credit_score 350 850 650.5288 652 96.6533 -0.07158513 -0.4266275
age 18 92 38.9218 37 10.48781 1.011017 1.393171
tenure 0 10 5.0128 5 2.892174 0.01098816 -1.16561
balance 0 250898.1 76485.89 97198.54 62397.41 -0.1410664 -1.489569
num_of_products 1 4 1.5302 1 0.5816544 0.7453442 0.581373
has_cr_card 0 1 0.7055 1 0.4558405 -0.9015411 -1.187342
is_active_member 0 1 0.5151 1 0.4997969 -0.0604185 -1.996549
estimated_salary 11.58 199992.5 100090.2 100193.9 57510.49 0.002084732 -1.181891
exited 0 1 0.2037 0 0.4027686 1.471169 0.1643553
num_attributes %>%
  purrr::keep(is.numeric) %>% 
  tidyr::gather() %>% 
  ggplot2::ggplot(ggplot2::aes(value)) +
    ggplot2::facet_wrap(~ key, scales = "free") +
    ggplot2::geom_histogram(col= "black", fill="steelblue", bins = 25)+
    ggplot2::scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
  ggplot2::labs(title = "Distribution of numerical variables")+
  ggplot2::scale_x_continuous(labels = function(x) format(x, scientific = FALSE))+
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))

data_raw %>%
  mutate(exited = as.factor(exited)) %>% 
  dplyr::count(exited) %>% 
  mutate(prop = round(n/sum(n)*100,2)) %>% 
  ggplot2::ggplot(ggplot2::aes(x = exited, y = prop, color = exited)) +
  ggsci::scale_color_jco() +
  ggplot2::geom_segment(ggplot2::aes(xend = exited, yend = 0), show.legend = F) +
  ggplot2::geom_point(ggplot2::aes(size = prop), show.legend = F) +
  ggplot2::geom_label(ggplot2::aes(label = paste0(prop,"% / n = ",n)),
             fill = "white", 
             hjust = "inward",
             show.legend = F) +
  ggplot2::labs(y = "%",
       x = "Exited") +
  ggplot2::coord_flip() +
  ggplot2::theme_minimal()

The company’s current cancellation rate is 20%.
A taxa de cancelamento atual da empresa é de 20%.

3.7.2 Categorical Attributes

apply(cat_attributes, 2, function(x) length(unique(x)))
  surname geography    gender 
     2932         3         2 
ggpubr::ggarrange(
data_raw %>%
  dplyr::count(gender) %>% 
  mutate(prop = round(n/sum(n)*100,2)) %>% 
  ggplot2::ggplot(ggplot2::aes(x = gender, y = prop, color = gender)) +
  ggsci::scale_color_jco() +
  ggplot2::geom_segment(ggplot2::aes(xend = gender, yend = 0), show.legend = F) +
  ggplot2::geom_point(ggplot2::aes(size = prop), show.legend = F) +
  ggplot2::geom_label(ggplot2::aes(label = paste0(prop,"% / n = ",n)),
             fill = "white", 
             hjust = "inward",
             show.legend = F) +
  ggplot2::labs(y = "%",
       x = "Gender", title = "Gender") +
  ggplot2::coord_flip() +
  ggplot2::theme_minimal()+
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18)),

data_raw %>%
  dplyr::count(geography) %>% 
  mutate(prop = round(n/sum(n)*100,2)) %>% 
  ggplot2::ggplot(ggplot2::aes(x = geography, y = prop, color = geography)) +
  ggsci::scale_color_jco() +
  ggplot2::geom_segment(ggplot2::aes(xend = geography, yend = 0), show.legend = F) +
  ggplot2::geom_point(ggplot2::aes(size = prop), show.legend = F) +
  ggplot2::geom_label(ggplot2::aes(label = paste0(prop,"% / n = ",n)),
             fill = "white", 
             hjust = "inward",
             show.legend = F) +
  ggplot2::labs(y = "%",
       x = "Geography", title = "Geography") +
  ggplot2::coord_flip() +
  ggplot2::theme_minimal()+
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18)),

ncol = 2)

Conclusions:
Conclusões:

  • The average age of customers is 38 years.

  • A média de idade dos clientes esta em 38 anos.

  • Balance has an almost normal distribution, with most customers between 100,000.00 and 130,000.00 euros in their bank account.

  • Balance possui uma distribuição quase normal, tendo na sua maioria clientes entre 100,000.00 a 130,000.00 euros em sua conta bancaria.

  • Most customers have an average score of 400 to 700.

  • A maioria dos clientes possui um score médio de 400 a 700.

  • Customers have an average of 100,000.00 euros per month.

  • Os clientes possuem em média 100,000.00 euros por mês.

  • Most customers have a credit card.

  • A maioria do clientes possui cartão de credito.

  • Customers who have and have not done banking in the past 12 months are almost balanced.

  • Os clientes que realizaram e nao realizaram um operação bancaria no ultimos 12 meses é quase equilibrado.

  • Most customers own at least one bank product.

  • A maioria dos clientes possuem pelo menos um produto do banco.

  • The average number of customers who are active in the bank is 5 years.

  • A média de clientes que estão ativos no banco é de 5 anos.

  • Only 20% is healthy churn.

  • Apenas 20% é são churn.

  • 55% of customers are male.

  • 55% dos clientes são masculino.

  • 50% of customers are from France.

  • 50% dos clientes são da frança.

num_attributes %>% 
  filter(balance == 0) %>% 
  count()
      n
1: 3617

4 Feature Engineering

4.1 Mind Map

knitr::include_graphics("/home/renato/repos/churn/img/Churn.png")

4.1.1 Customers Hypotheses

1 Customers with higher wages should have a higher churn rate.
1 Clientes com salarios mais altos, devem ter um indice de churn maior.

2 Customers with a low level of satisfaction are more likely to churn.
2 Clientes com um nivel de satisfação baixo , tendem mais a churn.

3 Customers with a low bank account balance are more likely to churn.
3 Clientes com saldo em conta bancária baixo tendem mais a churn.

4 Younger customers must cancel services more.
4 Clientes mais novos devem cancelar mais os serviçõs.

4.1.2 Gender Hypotheses

1 Male customers are more likely to churn.
1 Clientes do sexo masculino tendem mais a churn.

4.1.3 Temporal

1 The greater the number of active years the customer has, the lower the risk of churn.
1 Quanto maior o numero de anos ativo o cliente tem, menor é risco de churn.

4.1.4 Geographi

1 Churn rate should be higher for clients from French.
1 O churn rate deve ser maior para clientes da França.

4.1.5 Products

1 Customers with only 1 product should experience higher churn.
1 Clientes com apenas 1 produto devriam ter churn maior.

4.1.6 Final Hypothesis List

1 Customers with higher wages should have a higher churn rate.
1 Clientes com salarios mais altos, devem ter um indice de churn maior.

2 Customers with a low bank account balance are more likely to churn.
2 Clientes com saldo em conta bancária baixo tendem mais a churn.

3 Younger customers must cancel services more.
3 Clientes mais novos devem cancelar mais os serviçõs.

4 Male customers are more likely to churn.
4 Clientes do sexo masculino tendem mais a churn.

5 The greater the number of active years the customer has, the lower the risk of churn.
5 Quanto maior o numero de anos ativo o cliente tem, menor é risco de churn.

6 Churn rate should be higher for clients from French.
6 O churn rate deve ser maior para clientes da França.

7 Customers with only 1 product should experience higher churn.
7 Clientes com apenas 1 produto devriam ter churn maior.

df1 <- data_raw

head(df1) %>% knitr::kable() %>% 
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condesend", "responsive"),html_font = "Cambria")
row_number customer_id surname credit_score geography gender age tenure balance num_of_products has_cr_card is_active_member estimated_salary exited
1 15634602 Hargrave 619 France Female 42 2 0.00 1 1 1 101348.88 1
2 15647311 Hill 608 Spain Female 41 1 83807.86 1 0 1 112542.58 0
3 15619304 Onio 502 France Female 42 8 159660.80 3 1 0 113931.57 1
4 15701354 Boni 699 France Female 39 1 0.00 2 0 0 93826.63 0
5 15737888 Mitchell 850 Spain Female 43 2 125510.82 1 1 1 79084.10 0
6 15574012 Chu 645 Spain Male 44 8 113755.78 2 1 0 149756.71 1
# Create new features
df1 <- feature_engineering(df1)
# Visualize news features
t(head(df1)) %>% knitr::kable() %>% 
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condesend", "responsive"),html_font = "Cambria")
row_number 1 2 3 4 5 6
customer_id 15634602 15647311 15619304 15701354 15737888 15574012
surname Hargrave Hill Onio Boni Mitchell Chu
credit_score 619 608 502 699 850 645
geography France Spain France France Spain Spain
gender Female Female Female Female Female Male
age 42 41 42 39 43 44
tenure 2 1 8 1 2 8
balance 0.00 83807.86 159660.80 0.00 125510.82 113755.78
num_of_products 1 1 3 2 1 2
has_cr_card 1 0 1 0 1 1
is_active_member 1 1 0 0 1 0
estimated_salary 101348.88 112542.58 113931.57 93826.63 79084.10 149756.71
exited 1 0 1 0 0 1
tenure_year 3 2 9 2 3 9
age_ten_year 14.000000 20.500000 4.666667 19.500000 14.333333 4.888889
cred_ten_year 206.33333 304.00000 55.77778 349.50000 283.33333 71.66667
cred_age 14.73810 14.82927 11.95238 17.92308 19.76744 14.65909
amount 101348.88 196350.44 273592.37 93826.63 204594.92 263512.49
amount_credit 163.7300 322.9448 545.0047 134.2298 240.6999 408.5465
amount_ten_year 33782.96 98175.22 30399.15 46913.32 68198.31 29279.17
amount_prod 101348.88 196350.44 91197.46 46913.32 204594.92 131756.24
cred_prod 619.0000 608.0000 167.3333 349.5000 850.0000 322.5000
bal_ten_year 0.00 41903.93 17740.09 0.00 41836.94 12639.53
prod_m_cr 0 1 2 2 0 1
prod_t_cr 1 0 3 0 1 2

5 EDA - Exploration Data Analysis

df2 <- df1

5.1 Univariate Analysis

5.1.1 Variable Response

df2 %>%
  mutate(exited = as.factor(exited)) %>% 
  dplyr::count(exited) %>% 
  mutate(prop = round(n/sum(n)*100,2)) %>% 
  ggplot2::ggplot(ggplot2::aes(x = exited, y = prop, color = exited)) +
  ggsci::scale_color_jco() +
  ggplot2::geom_segment(ggplot2::aes(xend = exited, yend = 0), show.legend = F) +
  ggplot2::geom_point(ggplot2::aes(size = prop), show.legend = F) +
  ggplot2::geom_label(ggplot2::aes(label = paste0(prop,"% / n = ",n)),
             fill = "white", 
             hjust = "inward",
             show.legend = F) +
  ggplot2::labs(y = "%",
       x = "Exited", title = "Distribution Exited") +
  ggplot2::coord_flip() +
  ggplot2::theme_minimal()+
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))

df2 %>% 
  ggplot(aes(exited))+
  geom_histogram(fill= "steelblue",col="black")+
  scale_x_continuous(breaks = seq(0,1))

5.1.2 Distribution of Numerical Variables

num_attributes %>%
  purrr::keep(is.numeric) %>% 
  tidyr::gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram(col= "black", fill="steelblue", bins = 25)+
    scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
    scale_x_continuous(labels = function(x) format(x, scientific = FALSE))+
  labs(title = "Distribution of numerical variables")+
  theme(plot.title = element_text(hjust = 0.5, size = 18))

5.1.3 Categorical Variables

ggpubr::ggarrange(
df2 %>%
  dplyr::count(gender) %>% 
  mutate(prop = round(n/sum(n)*100,2)) %>% 
  ggplot2::ggplot(ggplot2::aes(x = gender, y = prop, color = gender)) +
  ggsci::scale_color_jco() +
  ggplot2::geom_segment(ggplot2::aes(xend = gender, yend = 0), show.legend = F) +
  ggplot2::geom_point(ggplot2::aes(size = prop), show.legend = F) +
  ggplot2::geom_label(ggplot2::aes(label = paste0(prop,"% / n = ",n)),
             fill = "white", 
             hjust = "inward",
             show.legend = F) +
  ggplot2::labs(y = "%",
       x = "Gender", title = "Gender") +
  ggplot2::coord_flip() +
  ggplot2::theme_minimal()+
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18)),

df2 %>%
  dplyr::count(geography) %>% 
  mutate(prop = round(n/sum(n)*100,2)) %>% 
  ggplot2::ggplot(ggplot2::aes(x = geography, y = prop, color = geography)) +
  ggsci::scale_color_jco() +
  ggplot2::geom_segment(ggplot2::aes(xend = geography, yend = 0), show.legend = F) +
  ggplot2::geom_point(ggplot2::aes(size = prop), show.legend = F) +
  ggplot2::geom_label(ggplot2::aes(label = paste0(prop,"% / n = ",n)),
             fill = "white", 
             hjust = "inward",
             show.legend = F) +
  ggplot2::labs(y = "%",
       x = "Geography", title = "Geography") +
  ggplot2::coord_flip() +
  ggplot2::theme_minimal()+
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18)),

ncol = 2)

5.2 Bivariate Analysis

df2 %>% 
  group_by(tenure) %>% 
  summarise(exited = sum(exited), .groups = "drop") %>%
 highcharter::hchart(
  'line', highcharter::hcaes(x = tenure, y = exited),
  color = "steelblue"
  ) 
round(cor(df2$tenure, df2$exited),2)
[1] -0.01

Weak and neagtive correlation.
Correlação fraca e negativa.

The highest churn rate is in the first year, followed by the ninth year.
O maior indice de churn se da no primeiro ano , seguido do nono ano.

H1 Customers with higher wages should have a higher churn rate.
H1 Clientes com salarios mais altos, devem ter um indice de churn maior.

True Customers with high wages, cancel more contracts, with wages from 140,000.00 to 200,000.00 euros .
Verdade Clientes com salarios elevados , cancelam mais os contratos,com salarios de 140,000.00 a 200,000.00 euros.

fig.2 <- df2 %>% 
  mutate(exited = as.factor(exited)) %>% 
  ggplot(aes(estimated_salary, fill = exited))+
  geom_density(alpha = 0.3)+
  labs(title = "Distrution Estimate Salary / Exited")+
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))+
  scale_x_continuous(breaks = seq(0,200000,10000))+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

ggplotly(fig.2) %>% layout(autosize = F, width = 900, height = 450)
round(cor(df2$estimated_salary, df2$exited),2)
[1] 0.01

Weak and positive correlation.
Correlação fraca e positiva.

H2 Customers with a low bank account balance are more likely to churn.
H2 Clientes com saldo em conta bancária baixo tendem mais a churn.

False Customers with higher amounts in account, cancel the contracts more, if you see this starting from 80,000.00 euros.
Falsa Clientes com valores maiores em conta , cancelam mais os contratos, se ve isso apartir de 80,000.00 euros.

fig.3 <- df2 %>% 
  mutate(exited = as.factor(exited)) %>% 
  ggplot(aes(balance, fill= exited))+
  geom_density(alpha= 0.3)+
  labs(title = "Distrution Balance / Exited")+
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))+
  scale_x_continuous(breaks = seq(0,250000,20000))

ggplotly(fig.3) %>% layout(autosize = F, width = 900, height = 450)
round(cor(df2$balance, df2$exited),2)
[1] 0.12

Weak and positive correlation.
Correlação fraca e positiva.

H3 Younger customers must cancel services more.
H3 Clientes mais novos devem cancelar mais os serviçõs.

False Customers start canceling contracts after the age of 42.
Falsa Os clientes começam a cancelar os contratos apartir dos 42 anos.

fig.4 <- df2 %>% 
  mutate(exited = as.factor(exited)) %>% 
  ggplot(aes(age, fill= exited))+
  geom_density(alpha= 0.3)+
  labs(title = "Distrution Age / Exited")+
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))+
  scale_x_continuous(breaks = seq(0,80,3))

ggplotly(fig.4) %>% layout(autosize = F, width = 900, height = 450)
round(cor(df2$age, df2$exited),2)
[1] 0.29

Weak and positive correlation.
Correlação fraca e positiva.

H4 Male customers are more likely to churn.
H4 Clientes do sexo masculino tendem mais a churn.

False Male customers represent 9% of 10,000.00 churns
Falsa Clientes do sexo masculino representam 9% dos 10,000.00 churns

fig.5 <- df2 %>% 
  mutate(exited = as.factor(exited)) %>% 
  ggplot(aes(gender, fill= exited))+
  geom_bar(col = "black")+
  labs(title = "Distrution Gender / Exited")+
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))

ggplotly(fig.5) %>% layout(autosize = F, width = 900, height = 450)

Churn quantity between male and female. Quantidade churn entre os sexos masculino e feminino.

table(df2$gender, df2$exited)
        
            0    1
  Female 3404 1139
  Male   4559  898

Proportion of churn between males and females.
Proporção de churn entre os sexos masculino e feminino.

prop.table(table(df2$gender, df2$exited))*100
        
             0     1
  Female 34.04 11.39
  Male   45.59  8.98

H5 The greater the number of active years the customer has, the lower the risk of churn.
H5 Quanto maior o numero de anos ativo o cliente tem, menor é risco de churn.

False Over the years, the rate remains approximately balanced.
Falsa Com passar dos anos a taxa se mantem aproximadamente equilibrada.

fig.6 <- df2 %>% 
  mutate(exited = as.factor(exited),
         tenure = as.factor(tenure)) %>% 
  ggplot(aes(tenure, fill= exited))+
  geom_bar(col = "black")+
  labs(title = "Distrution Tenure / Exited")+
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))

ggplotly(fig.6) %>% layout(autosize = F, width = 900, height = 450)

Proportion of churn over the years for active customers.
Proporção de churn ao longo dos anos para clientes ativos.

(prop.table(table(df2$tenure, df2$exited))*100) 
    
        0    1
  0  3.18 0.95
  1  8.03 2.32
  2  8.47 2.01
  3  7.96 2.13
  4  7.86 2.03
  5  8.03 2.09
  6  7.71 1.96
  7  8.51 1.77
  8  8.28 1.97
  9  7.71 2.13
  10 3.89 1.01

H6 Churn rate should be higher for clients from French.
H6 A taxa de churn deve ser maior para clientes da França.

False Because Germany is a country with a higher churn rate, although this difference with France is small.
Falsa Pois a Alemanha é pais com maior taxa de churn, embora essa diferença com a frança seja pequena.

fig.7 <- df2 %>% 
  mutate(exited = as.factor(exited)
        ) %>% 
  ggplot(aes(geography, fill= exited))+
  geom_bar(col = "black")+
  labs(title = "Distrution Geographi / Exited")+
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))

ggplotly(fig.7) %>% layout(autosize = F, width = 900, height = 450)

Proportion of churn by country.
Proporção de churn por Pais.

prop.table(table(df2$geography, df2$exited))*100
         
              0     1
  France  42.04  8.10
  Germany 16.95  8.14
  Spain   20.64  4.13

H7 Customers with only 1 product should experience higher churn.
H7 Clientes com apenas 1 produto devriam ter churn maior.

True Customers who have only one product have a higher churn rate.
Verdade Clientes que possuem apenas um produto tem um idíce de churn maior.

fig.8 <- df2 %>% 
  mutate(exited = as.factor(exited)
        ) %>% 
  ggplot(aes(num_of_products, fill= exited))+
  geom_bar(col = "black")+
  labs(title = "Distrution Number of Products / Exited")+
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))

ggplotly(fig.8) %>% layout(autosize = F, width = 900, height = 450)  
round(cor(df2$num_of_products, df2$exited),2)
[1] -0.05

Weak and negative correlation.
Correlação fraca e negativa.

5.3 Multivariate Analysis

5.3.1 Numerical Attributes

df2 %>% 
  purrr::keep(is.numeric) %>% 
  cor() %>% 
  ggcorrplot::ggcorrplot(hc.order = T,
             type = "lower",
             lab=T,
             lab_size = 3,
             method = "square",
             colors = c("chocolate1","white","darkcyan"),
             ggtheme = theme_minimal())

5.3.2 Categorical Attributes

df2 %>%
  purrr::keep(is.factor) %>% 
  as.data.frame() %>% 
  catcor(type="cramer") %>% 
  ggcorrplot::ggcorrplot(hc.order = T,
             type = "lower",
             lab=T,
             lab_size = 3,
             method = "square",
             colors = c("chocolate1","white","steelblue"),
             ggtheme = theme_minimal())

6 Data Preparation

6.1 Splitting Data

df2 <- df2 %>% mutate(exited = ifelse(exited == 1, "yes","no"),
                      exited= as.factor(exited))
set.seed(1234)

# data division, where 80% is for training and 20% for testing.
data_split <- rsample::initial_split(df2, prop = 0.8, strata = exited)

# getting dataset of training
train_data <- rsample::training(data_split)

# getting dataset of testing
test_data <- rsample::testing(data_split)

# Cross-validation
cv <- rsample::vfold_cv(train_data, strata = exited)

summary statistical

# list of features type intenger

# "credit_score"      <- integer
# "age"               <- integer 
# "tenure"            <- integer
# "balance"           <- integer 
# "num_of_products"   <- integer
# "has_cr_card"       <- integer
# "is_active_member"  <- integer
# "estimated_salary"  <- integer    
# "prod_m_cr"         <- integer
# "prod_t_cr"         <- integer


df2 %>% 
  purrr::keep(is.integer) %>% select(-row_number, -customer_id) %>% summary()
  credit_score        age            tenure       num_of_products
 Min.   :350.0   Min.   :18.00   Min.   : 0.000   Min.   :1.00   
 1st Qu.:584.0   1st Qu.:32.00   1st Qu.: 3.000   1st Qu.:1.00   
 Median :652.0   Median :37.00   Median : 5.000   Median :1.00   
 Mean   :650.5   Mean   :38.92   Mean   : 5.013   Mean   :1.53   
 3rd Qu.:718.0   3rd Qu.:44.00   3rd Qu.: 7.000   3rd Qu.:2.00   
 Max.   :850.0   Max.   :92.00   Max.   :10.000   Max.   :4.00   
  has_cr_card     is_active_member   prod_m_cr        prod_t_cr   
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00  
 Median :1.0000   Median :1.0000   Median :1.0000   Median :1.00  
 Mean   :0.7055   Mean   :0.5151   Mean   :0.8247   Mean   :1.08  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:2.00  
 Max.   :1.0000   Max.   :1.0000   Max.   :4.0000   Max.   :4.00  

Checking if it has an outlier or not

df2 %>% 
  purrr::keep(is.integer) %>% select(-row_number, -customer_id, -is_active_member, - has_cr_card, -num_of_products) %>% 
  tidyr::gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_boxplot(col= "black", fill="steelblue", bins = 25)+
    scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
  labs(title = "Distribution of numerical variables integer")+
  theme(plot.title = element_text(hjust = 0.5, size = 18)) 

summary statistical

# list of features type double

# "tenure_year"       <- double
# "age_ten_year"      <- double
# "cred_ten_year"     <- double
# "cred_age"          <- double
# "amount"            <- double
# "amount_credit"     <- double
# "amount_ten_year"   <- double
# "amount_prod"       <- double
# "cred_prod"         <- double
# "bal_ten_year"      <- double
 

df2 %>% 
  purrr::keep(is.double) %>% summary()
    balance       estimated_salary     tenure_year      age_ten_year   
 Min.   :     0   Min.   :    11.58   Min.   : 1.000   Min.   : 1.636  
 1st Qu.:     0   1st Qu.: 51002.11   1st Qu.: 4.000   1st Qu.: 4.400  
 Median : 97199   Median :100193.91   Median : 6.000   Median : 6.429  
 Mean   : 76486   Mean   :100090.24   Mean   : 6.013   Mean   : 9.456  
 3rd Qu.:127644   3rd Qu.:149388.25   3rd Qu.: 8.000   3rd Qu.:11.000  
 Max.   :250898   Max.   :199992.48   Max.   :11.000   Max.   :79.000  
 cred_ten_year       cred_age          amount         amount_credit     
 Min.   : 31.82   Min.   : 4.857   Min.   :    90.1   Min.   :  0.1356  
 1st Qu.: 76.38   1st Qu.:14.089   1st Qu.:117726.8   1st Qu.:179.2684  
 Median :108.59   Median :17.286   Median :177123.0   Median :271.0889  
 Mean   :157.59   Mean   :17.874   Mean   :176576.1   Mean   :277.8788  
 3rd Qu.:186.00   3rd Qu.:20.962   3rd Qu.:241020.1   3rd Qu.:375.9885  
 Max.   :850.00   Max.   :46.889   Max.   :407730.8   Max.   :984.7171  
 amount_ten_year     amount_prod       cred_prod      bal_ten_year   
 Min.   :    10.7   Min.   :    45   Min.   : 94.0   Min.   :     0  
 1st Qu.: 17674.0   1st Qu.: 68215   1st Qu.:323.0   1st Qu.:     0  
 Median : 29519.9   Median :122878   Median :443.0   Median : 13281  
 Mean   : 42998.4   Mean   :137535   Mean   :486.4   Mean   : 18731  
 3rd Qu.: 50322.6   3rd Qu.:199910   3rd Qu.:653.0   3rd Qu.: 24638  
 Max.   :373236.3   Max.   :400348   Max.   :850.0   Max.   :197042  

Checking if it has an outlier or not

df2 %>% 
  purrr::keep(is.double)%>% 
  tidyr::gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_boxplot(col= "black", fill="steelblue", bins = 25)+
    scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
    scale_x_continuous(labels = function(x) format(x, scientific = FALSE))+
  labs(title = "Distribution of numerical variables doubles")+
  theme(plot.title = element_text(hjust = 0.5, size = 18)) 

# "surname"           <- factor
# "geography"         <- factor
# "gender"            <- factor 
# "exited"            <- factor
df2 %>% purrr::keep(is.factor)%>% summary()
     surname       geography       gender     exited    
 Smith   :  32   France :5014   Female:4543   no :7963  
 Martin  :  29   Germany:2509   Male  :5457   yes:2037  
 Scott   :  29   Spain  :2477                           
 Walker  :  28                                          
 Brown   :  26                                          
 Genovese:  25                                          
 (Other) :9831                                          

7 Rescaling

The rescaling methods applied below are based on the features distribution shape and boxplot outlier analysis.

Os métodos de reescalonamento aplicados a seguir são baseados na forma de distribuição das features e na análise de outlier de boxplot.

table_rescaling <- tibble(age = "outlier" , credit_score = "outlier" , prod_m_cr = "outlier" , 
                          prod_t_cr = "no outlier",tenure = "no outlier", age_ten_year = "outlier", 
                          amount = "no outlier", amount_credit = "outlier",amount_prod = "outlier", 
                          amount_ten_year = "outlier", bal_ten_year = "outlier" , balance = "no outlier",
                          cred_age = "outlier" , cred_prod = "no outlier",cred_ten_year = "outlier",
                          estimated_salary = "no outlier" ,tenure_year = "no outlier")

table_rescaling <- table_rescaling %>% tidyr::gather() %>% rename(features = key,
                                               actions = value)

knitr::kable(table_rescaling) %>% 
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("striped", "hover", "condesend", "responsive"),html_font = "Cambria")
features actions
age outlier
credit_score outlier
prod_m_cr outlier
prod_t_cr no outlier
tenure no outlier
age_ten_year outlier
amount no outlier
amount_credit outlier
amount_prod outlier
amount_ten_year outlier
bal_ten_year outlier
balance no outlier
cred_age outlier
cred_prod no outlier
cred_ten_year outlier
estimated_salary no outlier
tenure_year no outlier
rec_imbalanced <- recipes::recipe(exited ~. , train_data %>% select(-surname)) %>%
  
  # Removing the features, row_number, customer_id and surname.
  recipes::step_rm(row_number, customer_id) %>% 
  
  # normalizing features
  recipes::step_mutate(age = robust_scaler(age)) %>% 
  recipes::step_mutate(credit_score = robust_scaler(credit_score)) %>% 
  recipes::step_mutate(prod_m_cr = robust_scaler(prod_m_cr)) %>% 
  recipes::step_mutate(prod_t_cr= minmax_scaler(prod_t_cr)) %>% 
  recipes::step_mutate(tenure = minmax_scaler(tenure)) %>% 
  recipes::step_mutate(age_ten_year = robust_scaler(age_ten_year)) %>% 
  recipes::step_mutate(amount = minmax_scaler(amount)) %>% 
  recipes::step_mutate(amount_credit = robust_scaler(amount_credit)) %>% 
  recipes::step_mutate(amount_prod = robust_scaler(amount_prod)) %>% 
  recipes::step_mutate(amount_ten_year = robust_scaler(amount_ten_year)) %>% 
  recipes::step_mutate(bal_ten_year = robust_scaler(bal_ten_year)) %>% 
  recipes::step_mutate(balance = minmax_scaler(balance)) %>% 
  recipes::step_mutate(cred_age = robust_scaler(cred_age)) %>% 
  recipes::step_mutate(cred_prod = minmax_scaler(cred_prod)) %>% 
  recipes::step_mutate(cred_ten_year = robust_scaler(cred_ten_year)) %>% 
  recipes::step_mutate(estimated_salary = minmax_scaler(estimated_salary)) %>% 
  recipes::step_mutate(tenure_year = minmax_scaler(tenure_year)) %>% 
  # turning categorical features into numerics.
  recipes::step_dummy(recipes::all_nominal(),- recipes::all_outcomes(), one_hot = T) 
rec_balanced <- recipes::recipe(exited ~. , train_data %>% select(-surname)) %>%
  
  # Removing the features, row_number, customer_id and surname.
  recipes::step_rm(row_number, customer_id) %>% 
  
  # normalizing features
  recipes::step_mutate(age = robust_scaler(age)) %>% 
  recipes::step_mutate(credit_score = robust_scaler(credit_score)) %>% 
  recipes::step_mutate(prod_m_cr = robust_scaler(prod_m_cr)) %>% 
  recipes::step_mutate(prod_t_cr= minmax_scaler(prod_t_cr)) %>% 
  recipes::step_mutate(tenure = minmax_scaler(tenure)) %>% 
  recipes::step_mutate(age_ten_year = robust_scaler(age_ten_year)) %>% 
  recipes::step_mutate(amount = minmax_scaler(amount)) %>% 
  recipes::step_mutate(amount_credit = robust_scaler(amount_credit)) %>% 
  recipes::step_mutate(amount_prod = robust_scaler(amount_prod)) %>% 
  recipes::step_mutate(amount_ten_year = robust_scaler(amount_ten_year)) %>% 
  recipes::step_mutate(bal_ten_year = robust_scaler(bal_ten_year)) %>% 
  recipes::step_mutate(balance = minmax_scaler(balance)) %>% 
  recipes::step_mutate(cred_age = robust_scaler(cred_age)) %>% 
  recipes::step_mutate(cred_prod = minmax_scaler(cred_prod)) %>% 
  recipes::step_mutate(cred_ten_year = robust_scaler(cred_ten_year)) %>% 
  recipes::step_mutate(estimated_salary = minmax_scaler(estimated_salary)) %>% 
  recipes::step_mutate(tenure_year = minmax_scaler(tenure_year)) %>% 
  # turning categorical features into numerics.
  recipes::step_dummy(recipes::all_nominal(),- recipes::all_outcomes(), one_hot = T) %>% 
  themis::step_smote(exited)
# Imbalanced and pre-processed training data
train_data_imbalanced <- rec_imbalanced %>% recipes::prep(train_data) %>% juice()

# Imbalanced and pre-processed testing data
test_data_imbalanced <- rec_imbalanced %>% recipes::prep(test_data) %>% juice()

# Balanced and pre-processed training data
train_data_balanced <- rec_balanced %>% recipes::prep(train_data) %>% juice()

# Balanced and pre-processed training data
test_data_balanced <- rec_balanced %>% recipes::prep(test_data) %>% juice()
ggpubr::ggarrange(
train_data_imbalanced %>% 
  ggplot(aes(exited, fill= exited))+
  geom_bar(col="black")+
  labs(title = "Imbalanced Data")+
  theme(plot.title = element_text(hjust = 0.5, size = 18)), 

train_data_balanced %>% 
  ggplot(aes(exited, fill= exited))+
  geom_bar(col="black")+
  labs(title = "Balanced Data")+
  theme(plot.title = element_text(hjust = 0.5, size = 18)) , ncol = 2)

train_data_imbalanced %>%
  purrr::keep(is.numeric) %>% 
  tidyr::gather() %>% 
  ggplot2::ggplot(ggplot2::aes(value)) +
    ggplot2::facet_wrap(~ key, scales = "free") +
    ggplot2::geom_histogram(col= "black", fill="steelblue", bins = 25)+
    ggplot2::scale_y_continuous(labels = function(x) format(x, scientific = FALSE))+
  ggplot2::labs(title = "Distribution of numeric variables after pre-processing")+
  ggplot2::scale_x_continuous(labels = function(x) format(x, scientific = FALSE))+
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))

8 Feature Selection

8.1 Selection of features with Boruta

#boruta <- Boruta::Boruta(exited ~.,data = train_data_balanced %>% select(-row_number,-customer_id) ,doTrace =2 )

#saveRDS(boruta, "Boruta/boruta.rds")

boruta <- readRDS("Boruta/boruta.rds")

8.2 Boruta Statistics - Importance of Features

features_boruta <- Boruta::attStats(boruta) %>% arrange(desc(meanImp))

features_boruta <- as.data.frame(data.table::setDT(features_boruta, keep.rownames = "features"))

features_boruta
            features  meanImp medianImp   minImp   maxImp normHits  decision
1                age 75.62481  76.27921 71.55115 79.76610        1 Confirmed
2   is_active_member 58.01243  57.49542 56.77516 59.63506        1 Confirmed
3           cred_age 44.24553  44.69362 42.19361 45.38396        1 Confirmed
4    num_of_products 41.06186  41.63474 38.39232 42.39871        1 Confirmed
5            balance 39.06594  38.80633 37.04052 42.26464        1 Confirmed
6      cred_ten_year 38.18778  38.19954 35.61786 41.10785        1 Confirmed
7    amount_ten_year 37.15793  36.78392 34.18727 41.11690        1 Confirmed
8       age_ten_year 36.34803  36.42665 35.15741 37.48979        1 Confirmed
9       credit_score 34.86733  34.96885 31.92939 38.70913        1 Confirmed
10      bal_ten_year 34.69190  34.57109 33.07699 36.42140        1 Confirmed
11  estimated_salary 34.44135  34.42287 30.47675 38.42563        1 Confirmed
12     amount_credit 34.10562  34.03872 31.38059 37.15669        1 Confirmed
13         cred_prod 33.19357  33.10060 32.07205 35.25670        1 Confirmed
14 geography_Germany 31.53729  31.28621 29.57731 33.43443        1 Confirmed
15            tenure 30.14478  30.19744 27.83968 31.97109        1 Confirmed
16       tenure_year 29.79353  29.77035 27.55277 31.81791        1 Confirmed
17       amount_prod 29.49149  29.68004 27.39199 30.97616        1 Confirmed
18            amount 28.14608  28.03218 25.01640 32.38418        1 Confirmed
19         prod_m_cr 24.79515  24.76330 22.69695 26.57305        1 Confirmed
20       gender_Male 22.96577  22.69095 22.07488 24.49463        1 Confirmed
21     gender_Female 22.87308  22.92502 21.23916 24.94434        1 Confirmed
22         prod_t_cr 22.08248  22.18046 20.81962 22.90126        1 Confirmed
23  geography_France 18.53242  18.69797 16.66298 20.28926        1 Confirmed
24   geography_Spain 17.98637  17.96556 17.13462 18.80484        1 Confirmed
25       has_cr_card 12.85795  13.03965 11.54596 14.55708        1 Confirmed
features_boruta %>% 
  ggplot(aes(stats::reorder (features ,desc(meanImp)), meanImp))+
  geom_bar(stat="identity", fill="steelblue", col="black")+
  labs(title = "Importance of Boruta Features", x= "features")+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, size = 18))

9 Machine Learning Modelling

9.1 Checking performance on balanced and imbalanced data

9.1.1 Imbalanced Data

  • I will create 2 models, aiming to evaluate the performance of memos in balanced and unbalanced data.
  • Vou criar 2 modelos , visando avaliar a performance dos memos no dados balanceados e desbalanceados.

9.1.2 Logistic Regression Model

# Creating model 
lr <- logistic_reg() %>%
  set_engine("glm") 

# Trainnig model
lr_fit <- lr %>% fit(exited ~., data= train_data_imbalanced)

lr_wf <- workflow() %>% 
  add_model(lr) %>% 
  add_recipe(rec_imbalanced)

lr_res <- 
  last_fit(
    lr_wf,
    split = data_split,
    metrics = metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap)
  )

lr_result <- data.frame(t(collect_metrics(lr_res) %>% select(.metric, .estimate)))

rownames(lr_result) <- NULL

colnames(lr_result) <- lr_result[1,]

lr_result <- lr_result[-1, ]

lr_result <- data.frame(Model = "Logistic Regression - Imbalanced", lr_result)

lr_result %>% knitr::kable(caption = "Table Metrics Single - Imbalanced") %>% 
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
Table Metrics Single - Imbalanced
Model accuracy precision recall f_meas mcc kap roc_auc
2 Logistic Regression - Imbalanced 0.8444222 0.8460292 0.9836683 0.9096718 0.4358823 0.3713792 0.8296658

9.1.3 Logistic Regression Model - Cross Validation

#lr_fit_cv <- lr_wf %>% 
#  fit_resamples(cv, metrics=metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap),
#                control = control_resamples(save_pred = T))

#saveRDS(lr_fit_cv, "Models/lr_fit_cv.rds")

lr_fit_cv <- readRDS("Models/lr_fit_cv.rds")

lr_result_cv <- data.frame(t(collect_metrics(lr_fit_cv) %>% select(.metric, mean)))

rownames(lr_result_cv) <- NULL

colnames(lr_result_cv) <- lr_result_cv[1,]

lr_result_cv <- lr_result_cv[-1, ]

lr_result_cv <- data.frame(Model = "Logistic Regression -  Imbalanced", lr_result_cv)

lr_result_cv %>% knitr::kable(caption = "Table Metrics Cross Validation - Imbalanced") %>% 
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
Table Metrics Cross Validation - Imbalanced
Model accuracy f_meas kap mcc precision recall roc_auc
2 Logistic Regression - Imbalanced 0.8315229 0.9023105 0.3140080 0.3738247 0.8384426 0.9769265 0.8158259

9.1.4 Logistic Regression Model - Confusion Matrix

 lr_fit_cv %>%
   unnest(.predictions) %>%
   conf_mat(exited, .pred_class) %>% 
  autoplot(type = "heatmap")

9.1.5 Random Forest Model

# Creating model 
rf <- rand_forest(trees = 500) %>%
  set_engine("ranger") %>% 
  set_mode("classification")

# Trainnig model
#rf_fit <- rf %>% fit(exited ~., data= train_data_imbalanced)

#saveRDS(rf_fit, "Models/rf_fit.rds")

rf_fit <- readRDS("Models/rf_fit.rds")

rf_wf <- workflow() %>% 
  add_model(rf) %>% 
  add_recipe(rec_imbalanced)

# rf_res <- 
#   last_fit(
#     rf_wf,
#     split = data_split,
#     metrics = metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap)
#   )

#saveRDS(rf_res,"Models/rf_res.rds")

rf_res <- readRDS("Models/rf_res.rds")

rf_result <- data.frame(t(collect_metrics(rf_res) %>% select(.metric, .estimate)))

rownames(rf_result) <- NULL

colnames(rf_result) <- rf_result[1,]

rf_result <- rf_result[-1, ] 

rf_result <- data.frame(Model = "Random Forest -  Imbalanced", rf_result)

rf_result %>% knitr::kable(caption = "Table Metrics Single - Imbalanced") %>% 
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
Table Metrics Single - Imbalanced
Model accuracy precision recall f_meas mcc kap roc_auc
2 Random Forest - Imbalanced 0.8709355 0.8824541 0.9667085 0.9226619 0.5588921 0.5377695 0.8560678

9.1.6 Random Forest Model - Cross Validation

# rf_fit_cv <- rf_wf %>% 
#   fit_resamples(cv, metrics=metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap),
#                 control = control_resamples(save_pred = T))

#saveRDS(rf_fit_cv, "Models/rf_fit_cv.rds")

rf_fit_cv <- readRDS("Models/rf_fit_cv.rds")

rf_result_cv <- data.frame(t(collect_metrics(rf_fit_cv) %>% select(.metric, mean)))

rownames(rf_result_cv) <- NULL

colnames(rf_result_cv) <- rf_result_cv[1,]

rf_result_cv <- rf_result_cv[-1, ]

rf_result_cv <- data.frame(Model = "Random Forest -  Imbalanced", rf_result_cv)

rf_result_cv %>% knitr::kable(caption = "Table Metrics Cross Validation - Imbalanced") %>% 
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
Table Metrics Cross Validation - Imbalanced
Model accuracy f_meas kap mcc precision recall roc_auc
2 Random Forest - Imbalanced 0.8557690 0.9141315 0.4709430 0.4980182 0.8690588 0.9642139 0.8434486

9.1.7 Random Forest Model - Confusion Matrix

 rf_fit_cv %>%
   unnest(.predictions) %>%
   conf_mat(exited, .pred_class) %>% 
  autoplot(type = "heatmap")

9.2 Logistic Regression Model

9.2.1 balanced Data

# Creating model 
lr_bal <- logistic_reg() %>%
  set_engine("glm") 

# Trainnig model
lr_fit_bal <- lr_bal %>% fit(exited ~., data= train_data_balanced)

lr_wf_bal <- workflow() %>% 
  add_model(lr_bal) %>% 
  add_recipe(rec_balanced)

lr_res_bal <- 
  last_fit(
    lr_wf_bal,
    split = data_split,
    metrics = metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap)
  )

lr_result_bal <- data.frame(t(collect_metrics(lr_res_bal) %>% select(.metric, .estimate)))

rownames(lr_result_bal) <- NULL

colnames(lr_result_bal) <- lr_result_bal[1,]

lr_result_bal <- lr_result_bal[-1, ]

lr_result_bal <- data.frame(Model = "Logistic Regression -  Balanced", lr_result_bal)

lr_result_bal %>% knitr::kable(caption = "Table Metrics Single - Balanced") %>% 
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
Table Metrics Single - Balanced
Model accuracy precision recall f_meas mcc kap roc_auc
2 Logistic Regression - Balanced 0.8034017 0.9075459 0.8385678 0.8716944 0.4607166 0.4541640 0.8293016

9.2.2 Logistic Regression Model - Cross Validation

# lr_fit_cv_bal <- lr_wf_bal %>%
#  fit_resamples(cv, metrics=metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap),
#                control = control_resamples(save_pred = T))

#saveRDS(lr_fit_cv_bal, "Models/lr_fit_cv_bal.rds")

lr_fit_cv_bal <- readRDS("Models/lr_fit_cv_bal.rds")

lr_result_cv_bal <- data.frame(t(collect_metrics(lr_fit_cv_bal) %>% select(.metric, mean)))

rownames(lr_result_cv_bal) <- NULL

colnames(lr_result_cv_bal) <- lr_result_cv_bal[1,]

lr_result_cv_bal <- lr_result_cv_bal[-1, ]

lr_result_cv_bal <- data.frame(Model = "Logistic Regression - Balanced", lr_result_cv_bal)

lr_result_cv_bal %>% knitr::kable(caption = "Table Metrics Cross Validation - Balanced") %>% 
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
Table Metrics Cross Validation - Balanced
Model accuracy f_meas kap mcc precision recall roc_auc
2 Logistic Regression - Balanced 0.7749071 0.8498889 0.4036957 0.4162602 0.9038824 0.8031720 0.8150445

9.2.3 Logistic Regression Model - Confusion Matrix

 lr_fit_cv_bal %>%
   unnest(.predictions) %>%
   conf_mat(exited, .pred_class) %>% 
  autoplot(type = "heatmap")

9.2.4 Random Forest Model

# Creating model 
rf_bal <- rand_forest(trees = 500) %>%
  set_engine("ranger") %>% 
  set_mode("classification")

# Trainnig model
#rf_fit_bal <- rf_bal %>% fit(exited ~., data= train_data_balanced)

#saveRDS(rf_fit_bal, "Models/rf_fit_bal.rds")

rf_fit_bal <- readRDS("Models/rf_fit_bal.rds")

rf_wf_bal <- workflow() %>% 
  add_model(rf_bal) %>% 
  add_recipe(rec_balanced)

# rf_res_bal <-
#   last_fit(
#     rf_wf_bal,
#     split = data_split,
#     metrics = metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap)
#   )

#saveRDS(rf_res_bal,"Models/rf_res_bal.rds")

rf_res_bal <- readRDS("Models/rf_res_bal.rds")

rf_result_bal <- data.frame(t(collect_metrics(rf_res_bal) %>% select(.metric, .estimate)))

rownames(rf_result_bal) <- NULL

colnames(rf_result_bal) <- rf_result_bal[1,]

rf_result_bal <- rf_result_bal[-1, ] 

rf_result_bal <- data.frame(Model = "Random forest -  Balanced", rf_result_bal)

rf_result_bal %>% knitr::kable(caption = "Table Metrics Single - Balanced") %>% 
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
Table Metrics Single - Balanced
Model accuracy precision recall f_meas mcc kap roc_auc
2 Random forest - Balanced 0.8444222 0.9000625 0.9051508 0.9025994 0.5163299 0.5162793 0.8532605

9.2.5 Random Forest Model - Cross Validation

# rf_fit_cv_bal <- rf_wf_bal %>%
#   fit_resamples(cv, metrics=metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap),
#                 control = control_resamples(save_pred = T))

#saveRDS(rf_fit_cv_bal, "Models/rf_fit_cv_bal.rds")

rf_fit_cv_bal <- readRDS("Models/rf_fit_cv_bal.rds")

rf_result_cv_bal <- data.frame(t(collect_metrics(rf_fit_cv_bal) %>% select(.metric, mean)))

rownames(rf_result_cv_bal) <- NULL

colnames(rf_result_cv_bal) <- rf_result_cv_bal[1,]

rf_result_cv_bal <- rf_result_cv_bal[-1, ]

rf_result_cv_bal <- data.frame(Model = "Random Forest -  Balanced", rf_result_cv_bal)

rf_result_cv_bal %>% knitr::kable(caption = "Table Metrics Cross Validation - Balanced") %>% 
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
Table Metrics Cross Validation - Balanced
Model accuracy f_meas kap mcc precision recall roc_auc
2 Random Forest - Balanced 0.8385223 0.8993591 0.4908712 0.4916184 0.8926209 0.9062952 0.8394568

9.2.6 Random Forest Model - Confusion Matrix

rf_fit_cv_bal %>%
   unnest(.predictions) %>%
   conf_mat(exited, .pred_class) %>% 
  autoplot(type = "heatmap", label.color  = "blue")

9.2.7 Xgboost Model - Imbalanced

# Creating model 
xg <- boost_tree(trees = 500) %>%
  set_engine("xgboost") %>% 
  set_mode("classification")

# Trainnig model
#xg_fit <- xg %>% fit(exited ~., data= train_data_imbalanced)

#saveRDS(xg_fit, "Models/xg_fit.rds")

xg_fit <- readRDS("Models/xg_fit.rds")

xg_wf <- workflow() %>% 
  add_model(xg) %>% 
  add_recipe(rec_imbalanced)

# xg_res <-
#   last_fit(
#     xg_wf,
#     split = data_split,
#     metrics = metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap)
#   )

#saveRDS(xg_res,"Models/xg_res.rds")

xg_res <- readRDS("Models/xg_res.rds")

xg_result <- data.frame(t(collect_metrics(xg_res) %>% select(.metric, .estimate)))

rownames(xg_result) <- NULL

colnames(xg_result) <- xg_result[1,]

xg_result <- xg_result[-1, ] 

xg_result <- data.frame(Model = "Xgboost Model - Imbalanced", xg_result)

xg_result %>% knitr::kable(caption = "Table Metrics Single - Imbalanced") %>% 
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
Table Metrics Single - Imbalanced
Model accuracy precision recall f_meas mcc kap roc_auc
2 Xgboost Model - Imbalanced 0.8504252 0.8827709 0.9365578 0.9088692 0.5006732 0.4938782 0.8475516

9.3 Xgboost Model - Cross Validation

# xg_fit_cv <- xg_wf %>%
#   fit_resamples(cv, metrics=metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap),
#                 control = control_resamples(save_pred = T))

#saveRDS(xg_fit_cv, "Models/xg_fit_cv.rds")

xg_fit_cv <- readRDS("Models/xg_fit_cv.rds")

xg_result_cv <- data.frame(t(collect_metrics(xg_fit_cv) %>% select(.metric, mean)))

rownames(xg_result_cv) <- NULL

colnames(xg_result_cv) <- xg_result_cv[1,]

xg_result_cv <- xg_result_cv[-1, ]

xg_result_cv <- data.frame(Model = "Xgboost Model -  Imbalanced", xg_result_cv)

xg_result_cv %>% knitr::kable(caption = "Table Metrics Cross Validation - Imbalanced") %>% 
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
Table Metrics Cross Validation - Imbalanced
Model accuracy f_meas kap mcc precision recall roc_auc
2 Xgboost Model - Imbalanced 0.8432714 0.9053775 0.4529159 0.4653514 0.8716130 0.9419253 0.8220846

9.4 Xgboost Model Model - Confusion Matrix

 xg_fit_cv %>%
   unnest(.predictions) %>%
   conf_mat(exited, .pred_class) %>% 
  autoplot(type = "heatmap")

9.4.1 Xgboost Model - Balanced

# Creating model 
xg_bal <- boost_tree(trees = 500) %>%
  set_engine("xgboost") %>% 
  set_mode("classification")

# Trainnig model
#xg_fit_bal <- xg_bal %>% fit(exited ~., data= train_data_balanced)

#saveRDS(xg_fit, "Models/xg_fit_bal.rds")

xg_fit_bal <- readRDS("Models/xg_fit_bal.rds")

xg_wf_bal <- workflow() %>% 
  add_model(xg_bal) %>% 
  add_recipe(rec_balanced)

# xg_res_bal <-
#   last_fit(
#     xg_wf_bal,
#     split = data_split,
#     metrics = metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap)
#   )

#saveRDS(xg_res_bal,"Models/xg_res_bal.rds")

xg_res_bal <- readRDS("Models/xg_res_bal.rds")

xg_result_bal <- data.frame(t(collect_metrics(xg_res_bal) %>% select(.metric, .estimate)))

rownames(xg_result_bal) <- NULL

colnames(xg_result_bal) <- xg_result_bal[1,]

xg_result_bal <- xg_result_bal[-1, ] 

xg_result_bal <- data.frame(Model = "Xgboost Model -  Balanced", xg_result_bal)

xg_result_bal %>% knitr::kable(caption = "Table Metrics Single - Balanced") %>% 
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
Table Metrics Single - Balanced
Model accuracy precision recall f_meas mcc kap roc_auc
2 Xgboost Model - Balanced 0.8524262 0.8908981 0.9283920 0.9092587 0.5183932 0.5152418 0.8538809

9.5 Xgboost Model - Cross Validation

# xg_fit_cv_bal <- xg_wf_bal %>%
#   fit_resamples(cv, metrics=metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap),
#                 control = control_resamples(save_pred = T))

#saveRDS(xg_fit_cv_bal, "Models/xg_fit_cv_bal.rds")

xg_fit_cv_bal <- readRDS("Models/xg_fit_cv_bal.rds")

xg_result_cv_bal <- data.frame(t(collect_metrics(xg_fit_cv_bal) %>% select(.metric, mean)))

rownames(xg_result_cv_bal) <- NULL

colnames(xg_result_cv_bal) <- xg_result_cv_bal[1,]

xg_result_cv_bal <- xg_result_cv_bal[-1, ]

xg_result_cv_bal <- data.frame(Model = "Xgboost Model - Balanced", xg_result_cv_bal)

xg_result_cv_bal %>% knitr::kable(caption = "Table Metrics Cross Validation - Balanced") %>% 
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
Table Metrics Cross Validation - Balanced
Model accuracy f_meas kap mcc precision recall roc_auc
2 Xgboost Model - Balanced 0.8237736 0.8893934 0.4506173 0.4555440 0.8870763 0.8932634 0.8216237

9.6 Xgboost Model Model - Confusion Matrix

 xg_fit_cv_bal %>%
   unnest(.predictions) %>%
   conf_mat(exited, .pred_class) %>% 
  autoplot(type = "heatmap")

9.6.1 Comparing performance on Models with Imbalanced data

bind_rows(lr_result_cv, lr_result_cv_bal, rf_result_cv, rf_result_cv_bal, xg_result_cv, xg_result_cv_bal) %>% 
  knitr::kable(caption = "Performance Comparison Table for Models - Imbalanced and Balanced") %>% 
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
Performance Comparison Table for Models - Imbalanced and Balanced
Model accuracy f_meas kap mcc precision recall roc_auc
Logistic Regression - Imbalanced 0.8315229 0.9023105 0.3140080 0.3738247 0.8384426 0.9769265 0.8158259
Logistic Regression - Balanced 0.7749071 0.8498889 0.4036957 0.4162602 0.9038824 0.8031720 0.8150445
Random Forest - Imbalanced 0.8557690 0.9141315 0.4709430 0.4980182 0.8690588 0.9642139 0.8434486
Random Forest - Balanced 0.8385223 0.8993591 0.4908712 0.4916184 0.8926209 0.9062952 0.8394568
Xgboost Model - Imbalanced 0.8432714 0.9053775 0.4529159 0.4653514 0.8716130 0.9419253 0.8220846
Xgboost Model - Balanced 0.8237736 0.8893934 0.4506173 0.4555440 0.8870763 0.8932634 0.8216237

Looking at the accuracy we could say that the models with unbalanced data have better performance, this is not true, looking at the confusion matrix we see that the models are more correct when the class is “no” because it corresponds to approximately 80% of the database, considering that accuracy = number of correct predictions / total cases, unbalanced data have a high bias in the model, to evaluate the best model I will choose the Kappa or Cohen Kappa metrics is a statistical metric used to measure the models ’performance for items qualitative (categorical).
It is a more useful measure to use in problems that have an imbalance in the classes (for example, 70-30 divided for classes 0 and 1 and you can achieve 70% accuracy by predicting that all instances are for class 0).
So we can verify that the best set of data to be used will be with balanced data, even though this situation is obvious, I decided to illustrate to have a better understanding of why using the smote function in recipes to balance the data.

Olhando a precisão poderíamos dizer que os modelos com dados desbalanceados têm melhor desempenho, isso não é verdade, olhando a matriz de confusão vemos que os modelos ficam mais corretos quando a classe é “não” porque corresponde a aproximadamente 80% da base de dados, considerando que acurácia = número de previsões corretas / total de casos, dados não balanceados têm um viés alto no modelo, para avaliar o melhor modelo vou escolher as métricas Kappa ou Kappa de Cohen é uma métrica estatística usada para medir a iperformance de modelos para itens qualitativos (categóricos). É uma medida mais útil para usar em problemas que têm um desequilíbrio nas classes (por exemplo, 70-30 dividido para as classes 0 e 1 e você pode alcançar 70% de precisão prevendo que todas as instâncias são para a classe 0).
Sendo assim podemos verificar que o melhor conjunto de dados a ser ultiizado será com dados balanceados , mesmo essa situação sendo obvia , resolvi ilustrar para ter um melhro entendimento do porque usar a função smote no recipes para balacear os dados.

I will use the Random Forest model for this problem, as it had the best performance.
Vou usar o modelo Random Forest para este problema , pois ele teve a melhor performance.

10 Hyperparameter Fine tuning

# Specifying model to be tuned.
rf_spec <- rand_forest(
  mtry = tune(),
  trees = 3000,
  min_n = tune()
) %>%
  set_mode("classification") %>%
  set_engine("ranger")


rf_wf_spec <- workflow() %>%
  add_recipe(rec_balanced) %>%
  add_model(rf_spec)
# set.seed(345)
# 
# doParallel::registerDoParallel()
# 
# tune_result <- tune_grid(
#   
#   rf_wf_spec,
#   resamples = cv,
#   grid = 10, metrics=metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap),
#   control = control_grid(save_pred = TRUE)
# )

#saveRDS(tune_result, "hyperparameters/tune_result.rds")

tune_result <- readRDS("hyperparameters/tune_result.rds")
tune_result %>%
  collect_metrics() %>%
  filter(.metric == "roc_auc") %>%
  select(mean, min_n, mtry) %>%
  pivot_longer(min_n:mtry,
    values_to = "value",
    names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x") +
  labs(x = NULL, y = "Kappa")

best_kappa <- select_best(tune_result, "kap")

best_kappa
# A tibble: 1 x 3
   mtry min_n .config              
  <int> <int> <chr>                
1     6    30 Preprocessor1_Model01
final_rf <- finalize_model(
  rf_spec,
  best_kappa
)


final_rf
Random Forest Model Specification (classification)

Main Arguments:
  mtry = 6
  trees = 3000
  min_n = 30

Computational engine: ranger 
final_wf <- workflow() %>%
  add_recipe(rec_balanced) %>%
  add_model(final_rf)

# final_res <- final_wf %>%
#   last_fit(data_split, metrics=metric_set(accuracy, precision ,recall, f_meas , roc_auc ,mcc, kap))

#saveRDS(final_res, "Models/final_res.rds")

final_res <- readRDS("Models/final_res.rds")

final_res <- data.frame(t(collect_metrics(final_res) %>% select(.metric, .estimate)))

rownames(final_res) <- NULL

colnames(final_res) <- final_res[1,]

final_res <- final_res[-1, ]

final_res <- data.frame(Model = "Random Forest - Final Model", final_res)

final_res %>% knitr::kable(caption = "Table Metrics Random Forest - Final Model") %>%
  kableExtra::kable_styling(full_width = F, bootstrap_options = c("hover", "condesend"),html_font = "Cambria")
Table Metrics Random Forest - Final Model
Model accuracy precision recall f_meas mcc kap roc_auc
2 Random Forest - Final Model 0.8419210 0.9048223 0.8957286 0.9002525 0.5197302 0.5195764 0.8568457